This tutorial has two major aims: The first one is to show the general workflow of how land cover classifications (or similar tasks) based on satellite data can be performed in R using machine learning algorithms. The second important aim is to show how to assess the area to which a spatial prediction model can be applied (“Area of applicability”, AOA). This is relevant because in spatial predictive mapping, models are often applied to make predictions far beyond sampling locations (i.e. field observarions used to map a variable even on a global scale), where new locations might considerably differ in their environmental properties. However, areas in the predictor space without support of training data are problematic. The model has no knowledge about these environments and predictions for such areas have to be considered highly uncertain.
The example prediction task is to perfom a supervised land cover classification for the Münster in Germany. The dataset to do this includes selected spectral channels of a Sentinel-2 scene. As resposne (reference/ground truth) we use digitized polygons that were created on the basis of expert knowledge.
For this tutorial we need the terra package for processing of the satellite data as well as the caret package as a wrapper for machine learning (here: randomForest) algorithms. Sf is used for handling of the training data available as vector data (polygons). CAST will be used to account for spatial dependencies during model validation as well as for the estimation of the AOA.
rm(list=ls())
#major required packages:
library(terra)
library(sf)
library(caret)
library(mapview)
library(CAST)
library(tmap)
To start with, let’s load and explore the remote sensing raster data as well as the vector data that include the training sites.
sen_ms <- rast("data/sen_muenster.tif")
print(sen_ms)
## class : SpatRaster
## dimensions : 664, 798, 7 (nrow, ncol, nlyr)
## resolution : 10, 10 (x, y)
## extent : 400620, 408600, 5753560, 5760200 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=32 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## source : sen_muenster.tif
## names : B02, B03, B04, B08, B06, B07, ...
## min values : 37.5, 211.5, 1.0, 1, 313.1562, 279.8438, ...
## max values : 14582.5, 15787.5, 18767.5, 19929, 10972.9688, 11645.3438, ...
The multi-band raster contains a subset of the optical data from Sentinel-2 (see band information here: https://en.wikipedia.org/wiki/Sentinel-2) given in scaled reflectances (B02-B11). Let’s plot the multi-band raster to get an idea how the variables look like.
plot(sen_ms)
### true color and false color composites
plotRGB(sen_ms,r=3,g=2,b=1,stretch="lin")
plotRGB(sen_ms,r=4,g=3,b=2,stretch="lin")
As an additional predictor variable, we will calculate the NDVI and directly add it as a new raster band
sen_ms$NDVI <- (sen_ms$B08-sen_ms$B04)/(sen_ms$B08+sen_ms$B04)
plot(sen_ms$NDVI)
The vector file is read as sf object. It contains the training sites of 7 Land cover classes. These are polygons (33 in total) that were digitized in QGIS on the basis of the Sentinel data and with support of an aerial image and using expert knowledge. They can be regarded here as a ground truth for the land cover classification.
trainSites <- read_sf("data/trainingsites_muenster.gpkg")
print(trainSites)
## Simple feature collection with 33 features and 3 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 400682 ymin: 5753925 xmax: 408327.9 ymax: 5760113
## Projected CRS: WGS 84 / UTM zone 32N
## # A tibble: 33 × 4
## ClassID Label PolygonID geom
## <dbl> <chr> <int> <POLYGON [m]>
## 1 11 Planted field 1 ((402789.2 5759326, 402850.9 5759266, 402901…
## 2 11 Planted field 2 ((403154.2 5759459, 403357.6 5759470, 403373…
## 3 11 Planted field 3 ((403124.5 5759788, 403314.6 5759806, 403397…
## 4 14 Inland water 4 ((404809.3 5757129, 404865.6 5757163, 404865…
## 5 14 Inland water 5 ((403729 5756102, 403797.5 5756077, 403770.6…
## 6 14 Inland water 6 ((408139 5759018, 408201.5 5759011, 408189.7…
## 7 14 Inland water 7 ((406992.5 5756470, 407110.9 5756441, 407184…
## 8 14 Inland water 8 ((406334.9 5755042, 406351.1 5755068, 406382…
## 9 4 Grassland 9 ((403302.3 5755592, 403332.2 5755597, 403337…
## 10 4 Grassland 10 ((405602.6 5760078, 405610.9 5760104, 405628…
## # ℹ 23 more rows
Using mapview’s viewRGB function we can visualize the aerial image channels as true color composite in the geographical context and overlay it with the polygons. Click on the polygons to see which land cover class is assigned to a respective polygon.
viewRGB(as(sen_ms,"Raster"), r = 3, g = 2, b = 1, map.types = "Esri.WorldImagery")+
mapview(trainSites)
In order to train a machine learning model between the spectral properties and the land cover class, we first need to create a data frame that contains the predictor variables at the location of the training sites as well as the corresponding class information. This data frame can be produced with the extract function. However, first we randomly select pixels within the polygons, otherwise the training data set would be too large.
trainSites$PolygonID <- 1:nrow(trainSites) # lets keep the polygon ID
samplepoints <- st_sample(trainSites,800)
samplepoints <- st_intersection(trainSites,samplepoints)
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
Now we can extract and combine with the attributes from the reference sites.
trainDat <- extract(sen_ms, samplepoints, na.rm=FALSE)
trainDat <- cbind(trainDat, samplepoints)
head(trainDat)
## ID B02 B03 B04 B08 B06 B07 B11 NDVI ClassID
## 1 1 889 767 478 3114.0 2296.719 3177.531 1152.000 0.73385301 11
## 2 2 883 709 460 3412.0 2377.031 3473.312 1162.844 0.76239669 11
## 3 3 904 793 536 2530.5 2122.375 2615.500 1620.281 0.65041578 11
## 4 4 835 655 381 3240.0 2201.469 3221.531 1011.875 0.78956089 11
## 5 5 2137 2163 2442 2634.0 2537.219 2670.250 2543.500 0.03782506 4
## 6 6 912 828 463 4196.0 3148.594 4151.719 1681.750 0.80124490 4
## Label PolygonID geom
## 1 Planted field 25 POINT (402009.7 5755294)
## 2 Planted field 25 POINT (402029.9 5755504)
## 3 Planted field 2 POINT (403300.6 5759444)
## 4 Planted field 25 POINT (401946.7 5755365)
## 5 Industrial 15 POINT (406970.9 5754332)
## 6 Grassland 29 POINT (401575.1 5754460)
boxplot(NDVI~Label,data=trainDat)
featurePlot(x = trainDat[, c("B02","B04","B08","NDVI","B11")],
y = factor(trainDat$Label),
plot = "pairs",
## Add a key at the top
auto.key = list(columns = 3))
For model training we need to define the predictor and response variables. As predictors we can use basically all information from the satellite image as we might assume they could all be meaningful for the differentiation between the land cover classes. As response variable we use the “Label” column of the data frame.
predictors <- c("B02","B03","B04","B08","B06","B07","B11","NDVI")
response <- "Label"
We then train a Random Forest model to lean how the classes can be distinguished based on the predictors (note: other algorithms would work as well. See https://topepo.github.io/caret/available-models.html for a list of algorithms available in caret). Caret’s train function is doing this job.
# train the model
set.seed(100)
model <- train(trainDat[,predictors],
trainDat[,response],
method="rf",
importance=TRUE)
print(model)
## Random Forest
##
## 800 samples
## 8 predictor
## 7 classes: 'Fallow field', 'Grassland', 'Industrial', 'Inland water', 'Mixed forest', 'Planted field', 'Urban'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 800, 800, 800, 800, 800, 800, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9758167 0.9701738
## 5 0.9717264 0.9651267
## 8 0.9674102 0.9598195
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(varImp(model))
To perform the classification we can use the trained model and apply it to each pixel of the raster stack using the predict function. Then we can then create a map with meaningful colors of the predicted land cover using the tmap package.
prediction <- predict(sen_ms,model)
cols <- c("sandybrown", "green", "darkred", "blue", "forestgreen", "lightgreen", "red")
tm_shape(prediction) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
The code shown above was a basic example of how to train a model and do a classification. But there is more to it. We have to deal with spatial autocorrelation of the data for model tuning and validation and possibly also during a varible selection.
Before starting model training we can specify some control settings using trainControl. For hyperparameter tuning (mtry) as well as for error assessment we use a spatial 3-fold cross-validation. Therefore the training data are split into 3 folds but data from the same polygon are always grouped so that they never occur in both, training and testing. Also we make sure that each fold contains data from each land cover class. CAST’s CreateSpacetimeFolds is doing this job when we specify the polygon ID and the class label.
indices <- CreateSpacetimeFolds(trainDat,spacevar = "PolygonID",k=3,class="Label")
ctrl <- trainControl(method="cv",
index = indices$index,
savePredictions = TRUE)
We can visualize the “representativness” of the CV folds with plot_geodist:
geodist <- plot_geodist(samplepoints,sen_ms,
cvfolds = indices$indexOut,
cvtrain = indices$index)
See how it compares to a random CV:
randomfolds <- createFolds(samplepoints$ClassID, k = 3, list = TRUE, returnTrain = FALSE)
geodist <- plot_geodist(samplepoints,sen_ms,
cvfolds = randomfolds)
We can also explore how it would look like using NNDM
# Define the modeldomain (here the extent of the satellite image)
modeldomain <- st_as_sf(st_as_sfc(st_bbox(sen_ms)))
modeldomain <- st_transform(modeldomain,st_crs(samplepoints))
#Define folds:
nndmfolds <- knndm(samplepoints,modeldomain=modeldomain)
#plot
geodist <- plot_geodist(samplepoints,sen_ms,
cvfolds = nndmfolds$indx_test)
Model training is performed using caret’s train function. However if we want to test, which predictor variables are relevant for making predictions on new spatial locations, we can use the forward feature selection (ffs) function from the CAST package, which is a wrapper around the train function. We specify “rf” as method, indicating that a Random Forest is applied. For model training we reduce the number of trees (ntree) to 75 to speed things up. Note that usually a larger number (>250) is appropriate. We use the Kappa index for validation.
model <- ffs(trainDat[,predictors],
trainDat[,response],
method="rf",
metric="Kappa",
trControl=ctrl,
importance=TRUE,
ntree=75)
plot_ffs(model)
plot_ffs(model,plotType = "selected")
print(model)
## Random Forest
##
## 800 samples
## 5 predictor
## 7 classes: 'Fallow field', 'Grassland', 'Industrial', 'Inland water', 'Mixed forest', 'Planted field', 'Urban'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 507, 469, 624
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9718745 0.9637398
## 3 0.9675851 0.9580804
## 5 0.9668393 0.9572367
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
prediction <- predict(sen_ms,model)
cols <- c("sandybrown", "green", "darkred", "blue", "forestgreen", "lightgreen", "red")
tm_shape(prediction) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
When we print the model (see above) we get a summary of the prediction performance as the average Kappa and Accuracy of the three spatial folds. Looking at all cross-validated predictions together we can get the “global” model performance.
# get all cross-validated predictions:
cvPredictions <- model$pred[model$pred$mtry==model$bestTune$mtry,]
# calculate cross table:
table(cvPredictions$pred,cvPredictions$obs)
##
## Fallow field Grassland Industrial Inland water Mixed forest
## Fallow field 108 0 1 0 0
## Grassland 0 31 0 0 0
## Industrial 1 0 53 0 0
## Inland water 0 0 0 86 0
## Mixed forest 0 0 0 0 188
## Planted field 0 0 0 0 0
## Urban 1 0 2 0 1
##
## Planted field Urban
## Fallow field 0 7
## Grassland 4 0
## Industrial 0 1
## Inland water 0 0
## Mixed forest 2 5
## Planted field 100 0
## Urban 0 209
We have seen that technically, the trained model can be applied to the entire area of interest (and beyond…as long as the sentinel predictors are available which they are, even globally). But we should assess if we SHOULD apply our model to the entire area. The model should only be applied to locations that feature predictor properties that are comparable to those of the training data. If dissimilarity to the training data is larger than the disimmilarity within the training data, the model should not be applied to this location.
The calculation of the AOA is quite time consuming. To make a bit faster we use a parallelization.
AOA <- aoa(sen_ms,model)
## No trainDI provided. Computing DI of training data...
## Computing DI of newdata...
## Computing AOA...
plot(AOA)
plot(AOA$DI)
plot(AOA$AOA)
The result of the aoa function has two layers: the dissimilarity index (DI) and the area of applicability (AOA). The DI can take values from 0 to Inf, where 0 means that a location has predictor properties that are identical to properties observed in the training data. With increasing values the dissimilarity increases. The AOA has only two values: 0 and 1. 0 means that a location is outside the area of applicability, 1 means that the model is inside the area of applicability. Find more information on how the AOA is derived in Meyer&Pebesma (2020).
The figure above shows the predictions ONLY for the AOA (Locations outside the AOA are shown in grey). We see that the model can be applied to most parts of Münster, however there are some locations (especially in the south-west) that are too different in their predictor properties so that we should exclude those predictions from our prediction result.
For further help on handling of raster and vector data in R see e.g. https://geocompr.github.io/. More information on machine learning: e.g. https://link.springer.com/book/10.1007/978-1-4614-6849-3. More information on the relevance of spatial validation strategies as well on the AOA can be found in recordings from OpenGeoHub (https://www.uni-muenster.de/RemoteSensing/lehre/summer_schools/) or in the Tutorials of the CAST package: https://hannameyer.github.io/CAST/